This is one page of the R Handbook for Epidemiologists, but is being printed as a stand-alone page.

You can find the complete handbook on Github

Epidemic curves

Overview

Preparation

Preparation

Packages

This code chunk shows the loading of packages required for the analyses.

pacman::p_load(rio,          # File import
               here,         # File locator
               tidyverse,    # data management + ggplot2 graphics
               lubridate,    # working with  dates    
               aweek,        # alternative package for working with dates
               incidence,    # an option for epicurves of linelist data
               stringr,      # Search and manipulate character strings
               forcats,      # working with factors
               RColorBrewer) # Color palettes from colorbrewer2.org

Load data

Two example datasets are used in this section:

  • Linelist of individual cases from a simulated epidemic
  • Aggregated counts by hospital from the same simulated epidemic

The dataset is imported using the import() function from the rio package. See the page on importing data for various ways to import data. The linelist and aggregated versions of the data are displayed below.

For most of this document, the linelist dataset will be used. The aggregated counts dataset will be used at the end.

# import the linelist
linelist <- rio::import("linelist_cleaned.xlsx")

Review the two datasets and notice the differences

Case linelist

The first 50 rows are displayed

Case counts aggregated by hospital

The first 50 rows are displayed

Set parameters

You may want to set certain parameters for production of a report, such as the date for which the data is current (the “data date”). You can then reference the data_date in the code when applying filters or in captions that auto-update.

## set the report date for the report
## note: can be set to Sys.Date() for the current date
data_date <- as.Date("2015-05-15")

Verify dates

Verify that each relevant date column is class Date and has an appropriate range of values. This for loop prints a histogram for each column.

# create character vector of column names 
DateCols <- as.character(tidyselect::vars_select(names(linelist), matches("date|Date|dt")))

# Produce histogram of each date column
for (Col in DateCols) {     # open loop. iterate for each name in vector DateCols
  hist(linelist[, Col],     # print histogram of the column in linelist dataframe
       breaks = 50,         # number of breaks for the histogram
       xlab = Col)          # x-axis label is the name of the column
  }                         # close the loop

incidence package

incidence package

Below are tabs on making quick epicurves using the incidence package

CAUTION: Epicontacts expects data to be in a “linelist” format of one row per case (not aggregated). If your data is aggregated counts, look to the ggplot epicurves tab.

TIP: The documentation for plotting an incidence object can be accessed by entering ?plot.incidence in your R console.

https://cran.r-project.org/web/packages/incidence/vignettes/customize_plot.html#example-data-simulated-ebola-outbreak

Intro

Intro

2 steps are requires to plot an epicurve with the incidence package:

  1. Create an incidence object (using the function incidence())
    • Provide the case linelist
    • Specify the time interval into which the cases should be aggregated (daily, weekly, monthly..)
    • Specify any sub-groups
  2. Plot the incidence object
    • Specify labels, aesthetic themes, etc.

A simple example - an epicurve of daily cases:

# load incidence package
library(incidence)

# create the incidence object using data by day
epi_day   <- incidence(linelist$date_onset,  # the linelist data
                       interval = "day")     # the time interval

# plot the incidence object
plot(epi_day)

Change time interval of case aggregation (bars)

The interval argument defines how the observations are grouped. Available options include all the options from the package aweek, including but not limited to:

  • “week” (Monday start day is default)
  • “2 weeks” (or 3, 4, 5…)
  • “Sunday week”
  • “2 Sunday weeks” (or 3, 4, 5…)
  • “MMWRweek” (starts on Sunday - see US CDC)
  • “month” (1st of month)
  • “quarter” (1st of month of quarter)
  • “2 months” (or 3, 4, 5…)
  • “year” (1st day of calendar year)

Below are examples of how different intervals look when applied to the linelist.
Format and frequency of the date labels on the x-axis are the defaults for the specified interval.

# Create the incidence objects (with different intervals)
##############################
# Weekly (Monday week by default)
epi_wk      <- incidence(linelist$date_onset, interval = "Monday week")

# Sunday week
epi_Sun_wk  <- incidence(linelist$date_onset, interval = "Sunday week")

# Three weeks (Monday weeks by default)
epi_3wk     <- incidence(linelist$date_onset, interval = "3 weeks")

# Monthly
epi_month   <- incidence(linelist$date_onset, interval = "month")

# Plot the incidence objects (+ titles for clarity)
############################
plot(epi_wk)+     labs(title = "Monday weeks")
plot(epi_Sun_wk)+ labs(title = "Sunday weeks")
plot(epi_3wk)+    labs(title = "Every 3 Monday weeks")
plot(epi_month)+  labs(title = "Months")

Modifications

Modifications

The incidence package enables modifications in the following ways:

  • Arguments of plot() (e.g. show_cases, col_pal, alpha…)
  • scale_x_incidence() and make_labels()
  • ggplot() additions via the + operator

Read details in the Help files by entering ?scale_x_incidence and ?plot.incidence in the R console. Online vignettes are listed in the resources tab.

plot() modifications

plot() modifications

A incidence plot can be modified in the following ways. Type ?plot.incidence in the R console for more details.

  • show_cases = If TRUE, each case is shows as a box. Best on smaller outbreaks.
  • color = Color of case bars/boxes
  • border = Color of line around boxes, if show_cases = TRUE
  • alpha = Transparency of case bars/boxes (1 is fully opaque, 0 is fully transparent)
  • xlab = Title of x-axis (axis labels can also be applied using labs() from ggplot)
  • ylab = Title of y-axis; defaults to user-defined incidence time interval
  • labels_week = Logical, indicate whether x-axis labels are in week or date format, absent other modifications
  • n_breaks = Number of x-axis label breaks, absent other modifications
  • first_date, last_date Dates used to trim the plot

See examples of these arguments in the subsequent tabs.

Filtered data

Filtered data

To plot the epicurve of a subset of data:

  1. Filter the linelist data
  2. Feed the subset to the incidence() command

The example below uses data filtered to show only cases at Central Hospital.

# filter the dataset
central_data <- linelist %>% 
  filter(hospital == "Central Hospital")

# create incidence object using subset of data
central_outbreak <- incidence(central_data$date_onset, interval = "week")

# plot
plot(central_outbreak) + labs(title = "Weekly case incidence at Central Hospital")

Date-axis labels/gridlines

Date-axis labels/gridlines

TIP: Remember that date-axis labels are independent from the aggregation of the data into bars

Modify the bars
The aggregation of data into bars occurs when you set the interval = when creating the incidence object. The options for interval come from the package aweek and include options like “day”, “Monday week”, “Sunday week”, “month”, “2 weeks”, etc. See the incidence intro tab for more information.

Modify date-axis labels (frequency & format)

If working with the incidence package, you have several options to make these modifications. Some utilize the incidence package functions scale_x_date() and make_breaks(), others use the ggplot2 function scale_x_date(), and others use a combination.

DANGER: Be cautious setting the y-axis scale breaks (e.g. 0 to 30 by 5: seq(0, 30, 5)). Static numbers can cut-off your data if the data changes!.

Option 1: scale_x_incidence() only
  1. Add scale_x_incidence() from the incidence package:
    • Why use this approach?
      • Advantages: Short code. Auto-adjusts weekly labels to interval of incidence object (Monday, Sunday weeks, etc.)
      • Disadvantages: Cannot make fine adjustments to label format or minor vertical gridlines between labels
    • Provide the name of the incidence object to ensure labels align with specified interval (e.g. Sundays or Mondays)
    • optional: n_breaks specify number of date labels, which start from the interval of the first case.
      • for breaks every nth week, use n_breaks = nrow(i)/n (“i” is the incidence object name and “n” is a number)
    • optional: labels_week labels formatted as either weeks (YYYY-Www) or dates (YYYY-MM-DD)
    • One vertical gridline will appear per date label

Other notes:

  • Type ?scale_x_incidence into the R console to see more information.
  • If incidence interval is “month”, n_breaks and labels_week will behave differently
  • Adding scale_x_date() to the plot will remove labels created by scale_x_incidence
  • Note in plot below that the first label is 27 April 2014, the Sunday before the first case (May 1), aligning with Sunday weeks of the incidence object.
# create weekly incidence object (Sunday weeks)
i <- incidence(central_data$date_onset, interval = "Sunday week")

plot(i)+
  scale_x_incidence(i,                    # name of incidence object
                    labels_week = F,      # show dates instead of weeks
                    n_breaks = nrow(i)/8) # breaks every 8 weeks from week of first case

Option 2: scale_x_date() and make_breaks()
  1. Add scale_x_date() from ggplot2, but also leverage make_breaks() from incidence:
    • Why use this approach?
      • Advantages: Best of both worlds: weekly labels auto-aligned to incidence interval, and you can make detailed adjustments to label format
      • Disadvantages: If minor gridlines between Sunday-week date labels are desired, they are not auto-aligned
    • After creating the incidence object, use make_breaks() to define date label breaks
      • make_breaks() is similar to scale_x_incidence() (described above). Provide the incidence object name and optionally n_breaks as described before.
    • Add scale_x_date() to the plot:
      • breaks = provide the breaks vector you created with make_breaks(), followed by $breaks (see example below)
      • date_labels = provide a format for the date labels (e.g. “%d %b”) (use “” for new line)
# Break modification using scale_x_date() and make_breaks()
###########################################################
# make incidence object
i <- incidence(central_data$date_onset, interval = "Monday week")

# make breaks
i_labels <-  make_breaks(i, n_breaks = nrow(i)/6) # using interval from i, breaks every 6 weeks

# plot
plot(i)+
  scale_x_date(breaks      = i_labels$breaks, # call the breaks
               date_labels = "%d\n%b '%y",    # date format
               date_minor_breaks = "weeks")   # gridlines each week (aligns with Sundays only)  

Option 3: Use scale_x_date() only
  1. Use scale_x_date() only
    • Advantages: Complete control over breaks, labels, gridlines, and plot width
    • Disadvantages: More code required, more opportunity to make mistakes.
    • If your incidence intervals are days or Monday weeks, (easy!):
      • Provide interval to date_breaks = (e.g. “day”, “week”, “2 weeks”, “month”, “year”)
      • Provide interval to date_minor_breaks = for vertical lines between date labels
    • If your incidence intervals are Sunday weeks, it is more complex - see the tab for a Sunday week example
      • Provide a sequence of Sunday dates to breaks = and to minor_breaks =
    • Use date_labels = for formatting (see Dates page for tips)
    • Add the argument expand = c(0,0) to start labels at the first incidence bar. Otherwise, first label will shift depending on your specified label interval.

*Note: if using aggregated counts (for example an epiweek x-axis) your x-axis may not be Date class and may require use scale_x_discrete() instead of scale_x_date() - see ggplot tips page for more details.

# Break modification using scale_x_date() only
##############################################
# make incidence object
i <- incidence(central_data$date_onset, interval = "Monday week")

# plot
plot(i)+
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
               date_minor_breaks = "week",         # vertical lines appear every Monday week
               date_labels       = "%d\n%b\n'%y")  # date labels format 

A Sunday week example

If you want a plot of Sunday weeks and also finely-adjusted label formats, you might find a code example helpful.
Here is an example of producing a weekly epicurve using incidence for Sunday weeks, with finely-adjusted date labels through scale_x_date():

# load packages
pacman::p_load(tidyverse,  # for ggplot
               incidence,  # for epicurve
               lubridate)  # for floor_date() and ceiling_date()

# create incidence object (specifying SUNDAY weeks)
central_outbreak <- incidence(central_data$date_onset, interval = "Sunday week") # equivalent to "MMWRweek" (see US CDC)

# plot() the incidence object
plot(central_outbreak)+                  
  
  ### ggplot() commands added to the plot
  # scale modifications 
  scale_x_date(
    expand = c(0,0),                 # remove excess x-axis space below and after case bars
    
    # sequence by 3 weeks, from Sunday before first case to Sunday after last case
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                      by   = "3 weeks"),
    
    # sequence by week, from Sunday before first case to Sunday after last case
    minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                            to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                            by   = "7 days"),
    # date labels
    date_labels = "%d\n%b'%y")+       # adjust how dates are displayed
  
  scale_y_continuous(
    expand = c(0,0),                  # remove excess space under x-axis
    breaks = seq(0, 30, 5))+          # adjust y-axis intervals
  
  # Aesthetic themes
  theme_minimal()+                    # simplify background
  theme(
    axis.title = element_text(size = 12, face = "bold"),       # axis titles formatting
    plot.caption = element_text(face = "italic", hjust = 0))+  # caption formatting, left-aligned
  
  # Plot labels
  labs(x = "Week of symptom onset (Sunday weeks)", 
       y = "Weekly case incidence", 
       title = "Weekly case incidence at Central Hospital",
       #subtitle = "",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Show individual cases

Show individual cases

To show boxes around each individual case, use the argument show_cases = TRUE in the plot() function.

Boxes around each case can be more reader-friendly, if the outbreak is of a small size. Boxes can be applied when the interval is days, weeks, or any other time period. The code below creates the weekly epicurve for a smaller outbreak (only cases from Central Hospital), with boxes around each case.

# create filtered dataset for Central Hospital
central_data  <- linelist %>% 
  filter(hospital == "Central Hospital")

# create incidence object (weekly)
central_outbreak <- incidence(central_data$date_onset, interval = "Monday week")

# plot outbreak
plot(central_outbreak,
     show_cases = T)                 # show boxes around individual cases

The same epicurve showing individual cases, but with other aesthetic modifications:

# add plot() arguments and ggplot() commands
plot(central_outbreak,
     show_cases = T,                 # show boxes around each individual case
     color = "lightblue",            # color inside boxes
     border = "darkblue",            # color of border around boxes
     alpha = 0.5)+                    # transparency
  
  ### ggplot() commands added to the plot
  # scale modifications
  scale_x_date(
    expand            = c(0,0),         # remove excess x-axis space below and after case bars
    date_breaks       = "4 weeks",      # labels appear every 4 Monday weeks
    date_minor_breaks = "week",         # vertical lines appear every Monday week
    date_labels       = "%d\n%b'%y")+   # date labels format 
  
  scale_y_continuous(
    expand = c(0,0),              # remove excess space under x-axis
    breaks = seq(0, 35, 5))+      # adjust y-axis intervals
  
  # aesthetic themes
  theme_minimal()+                                                 # simplify background
  
  theme(
    axis.title = element_text(size = 12, face = "bold"),       # axis title format
    plot.caption = element_text(face = "italic", hjust = 0))+  # caption format and left-align
  
  # plot labels
  labs(x = "Week of symptom onset (Monday weeks)", 
       y = "Weekly reported cases", 
       title = "Weekly case incidence at Central Hospital",
       #subtitle = "",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Color by value

Color by value

To color the cases by value, provide the column to the groups = argument in the incidence() command. In the example below the cases are colored by their age category. Note the use of incidence() argument na_as_group =. If TRUE (by default) missing values (NA) will form their own group.

# Create incidence object, with data grouped by age category
age_outbreak <- incidence(linelist$date_onset,            # date of onset for x-axis
                               interval = "week",         # weekly aggregation of cases
                               groups = linelist$age_cat, # color by age_cat value
                               na_as_group = TRUE)        # missing values assigned their own group


# plot the epicurve
plot(age_outbreak) 

Adjusting order

To adjust the order of group appearance (on plot and in legend), the group column must be class Factor. Adjust the order by adjusting the order of the levels (including NA). Below is an example with gender groups using data from Central Hospital only.

  • First, the dataset is defined and gender is re-defined as a factor
  • The order of levels of gender are defined with NA first, so it appears on the top of the bars
  • More appropriate labels are defined for each factor level - these appear in the legend
  • The argument exclude = NULL in factor() is necessary to adjust the order of NA, which is excluded by default.
  • Title of legend adjusted using fill = in labs()

You can read more about factors in their page (LINK)

# Create incidence object, data grouped by gender
#################################################

# Classify "gender" column as factor
####################################
# with specific level order and labels, includin for missing values
central_data <- linelist %>% 
  filter(hospital == "Central Hospital") %>% 
  mutate(gender = factor(gender,
                         levels = c(NA, "f", "m"),
                         labels = c("Missing", "Female", "Male"),
                         exclude = NULL))

# Create incidence object, by gender
####################################
gender_outbreak_central <- incidence(central_data$date_onset, 
                                     interval = "week", 
                                     groups = central_data$gender,
                                     na_as_group = TRUE)   # Missing values assigned their own group

# plot epicurve with modifications
##################################
plot(gender_outbreak_central,
     show_cases = TRUE)+                            # show box around each case
     
     ### ggplot commands added to plot
     # scale modifications
     scale_x_date(expand = c(0,0),
                  date_breaks = "6 weeks",
                  date_minor_breaks = "week",
                  date_labels = "%d %b\n%Y")+
  
     # aesthetic themes
     theme_minimal()+                               # simplify plot background
     theme(
       legend.title = element_text(size = 14, face = "bold"),
       axis.title = element_text(face = "bold"))+   # axis title bold
     
      # plot labels
      labs(fill = "Gender",                         # title of legend
           title = "Show case boxes, with modifications",
           y = "Weekly case incidence",
           x = "Week of symptom onset")      

Change colors and legend

Change colors and legend

To change the legend
Use ggplot() commands such as:

  • theme(legend.position = "top") (or “bottom”, “left”, “right”)
  • theme(legend.direction = "horizontal")
  • theme(legend.title = element_blank()) to have no title

See the page of ggplot() tips for more details on legends.

To specify colors manually, provide the name of the color or a character vector of multiple colors to the argument color =. Note to function properly the number of colors listed must equal the number of groups (be aware of missing values as a group)

# weekly outbreak by hospital
hosp_outbreak <- incidence(linelist$date_onset, 
                               interval = "week", 
                               groups = linelist$hospital,
                               na_as_group = FALSE)   # Missing values not assigned their own group
# default colors
plot(hosp_outbreak)

# manual colors
plot(hosp_outbreak, color = c("darkgreen", "darkblue", "purple", "grey", "yellow", "orange"))

To change the color palette
Use the argument col_pal in plot() to change the color palette to one of the default base R palettes (do not put the name of the palette in quotes).

Other palettes include TO DO add page with palette names… To DO

# Create incidence object, with data grouped by age category
age_outbreak <- incidence(linelist$date_onset,            # date of onset for x-axis
                               interval = "week",         # weekly aggregation of cases
                               groups = linelist$age_cat, # color by age_cat value
                               na_as_group = TRUE)        # missing values assigned their own group

# plot the epicurve
plot(age_outbreak)

# plot with different color palette
plot(age_outbreak, col_pal = rainbow)

Facets/small multiples

Facets/small multiples

To facet the plot by a variable (make “small multiples”), see the tab on epicurves with ggplot()

ggplot()

ggplot()

Below are tabs on using the ggplot2 package to produce epicurves from a linelist dataset.

Unlike using incidence package, you must manually control the aggregation of the data (into weeks, months, etc) and the labels on the date axis. If not carefully managed, this can lead to many headaches.

These tabs use a subset of the linelist dataset - only the cases from Central Hospital.

central_data <- linelist %>% 
  filter(hospital == "Central Hospital")

Intro

Intro

To produce an epicurve with ggplot() there are three main elements:

  • A histogram, to aggregate the linelisted cases into “bins” and display bars of the counts per bin (potentially by grouped values)
  • Scales for the axes and their associated labels (see tab on modifications)
  • Aesthetic themes for the plot, including titles, labels, captions, etc.

Below is perhaps the most simple code to produce daily and weekly epicurves. Axis scales and labels use default options.

# daily 
ggplot(data = central_data, aes(x = date_onset)) +  # x column must be class Date
  geom_histogram(binwidth = 1)+                     # date values binned by 1 day 
  labs(title = "Daily")

# weekly
ggplot(data = central_data, aes(x = date_onset)) +  
  geom_histogram(binwidth = 7)+                     # date values binned each 7 days (arbitrary 7 days!) 
  labs(title = "Weekly")

CAUTION: Using binwidth = 7 starts the first bin at the first case, which could be any day of the week! To create specific Monday or Sunday weeks, see below .

To create weekly epicurves where the bins begin on a specific day of the week (e.g. Monday, Sunday), specify the histogram breaks = manually (not binwidth). This can be done by creating a sequence of dates using seq.Date() from base R. You can start/end the sequence at a specific date (as.Date("YYYY-MM-DD"), or write flexible code to begin the sequence at a specific day of the week before the first case. An example of creating such weekly breaks is below:

seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
         to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
         by   = "7 days")

To achieve the “from” value (earliest date of the sequence), the minimum value in the column date_onset is fed to floor_date() from the lubridate package, which according to the above specified arguments produces the start date of that “week”, given that the start of each week is a Monday (week_start = 1). Likewise, the “to” value (end date of the sequence) is specified using the inverse ceiling_date() function to produce the Monday after the last case. The “by” argument can be set to any length of days, weeks, or months.

This code is applied to create the histogram breaks, and also the breaks for the date labels. Read more about the date labels in the Modifications tab. Defining your breaks like above will be necessary if your weekly bins are not by Monday weeks.

Below is detailed code to produce weekly epicurves for Monday and Sunday weeks. See the tab on Modifications (axes) to learn the nuances of date-axis label management.

Monday weeks

Of note:

  • The break points of the histogram bins are specified manually to begin the Monday (week_start = 1) before the earliest case and to end the Monday after the last case (see explanation above).
  • The breaks for date labels on x-axis - because the bins are Monday weeks this code uses date_breaks = within scale_x_date(), which also uses Monday weeks. Sunday weeks use a different method.
  • Minor vertical gridlines between date labels are made using date_minor_breaks = within scale_x_date(), again because this plot is for Monday weeks. Sunday weeks use a different method.
  • Adding expand = c(0,0) to the x and y scales removes excess space on each side of the plot, which also ensures the labels begin at the first bar.
  • Color and fill are defined in geom_histogram()
# TOTAL MONDAY WEEK ALIGNMENT
#############################
ggplot(central_data, aes(x = date_onset)) + 
  # make histogram: specify bin break points: starts the Monday before first case, end Monday after last case
  geom_histogram(
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
                      by   = "7 days"), # bins are 7-days
    color = "darkblue",   # color of lines around bars
    fill = "lightblue") + # color of fill within bars
  
  # x-axis labels
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
               date_minor_breaks = "week",         # vertical lines appear every Monday week
               date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+             # remove excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(title    = "Weekly incidence of cases (Monday weeks)",
       subtitle = "Subtitle: Note alignment of bars, vertical lines, and axis labels on Mondays",
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Sunday weeks

The below code creates a histogram of the rows, using a date column as the x-axis. Of note:

  • The break points of the histogram bins are specified manually to begin the Sunday (week_start = 7) before the earliest case and to end the Monday after the last case (see explanation above).
  • The breaks for date labels on the x-axis and vertical gridlines - because the bins are not Monday weeks, manually specified vectors of dates are given to breaks = and minor_breaks = within scale_x_date(). You cannot use the scale_x_date() arguments of date_breaks and date_minor_breaks as these align with Monday weeks.
  • Adding expand = c(0,0) to the x and y scales removes excess space on each side of the plot, which also ensures the labels begin at the first bar.
  • Color and fill are defined in geom_histogram()
# TOTAL SUNDAY WEEK ALIGNMENT
#############################
ggplot(central_data, aes(x = date_onset)) + 
  
  # For histogram, manually specify bin break points: starts the Sunday before first case, end Sunday after last case
  geom_histogram(                    
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                      by   = "7 days"), # bins are 7-days
    color = "darkblue",   # color of lines around bars
    fill = "lightblue") + # color of fill within bars
  
  # The labels on the x-axis
  scale_x_date(expand = c(0,0),
               breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                                 to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                                 by   = "3 weeks"),
               minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                                       to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                                       by   = "7 days"),
               date_labels = "%d\n%b\n'%y")+             # day, above month abbrev., above 2-digit year
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(title    = "Weekly incidence of cases (Sunday weeks)",
       subtitle = "Subtitle: Note alignment of bars, vertical lines, and axis labels on Sundays",
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Modifications

Modifications

Modify axes

Modify axes

TIP: Remember that date-axis labels are independent from the aggregation of the data into bars

To modify the aggregation of data into bins/bars, do one of the following:

  • Specify a binwidth = within geom_histogram() - for a column of class Date, the given number is interpreted in days
  • Specify breaks = as a sequence of bin break-point dates
  • Group the rows into aggregated counts (by week, month, etc.) and feed the aggregated counts to ggplot(). See the tab on aggregated counts for more information.

To modify the date labels, use scale_x_date() in one of these ways:

  • If your histogram bins are days, Monday weeks, months, or years:
    • Use date_breaks = to specify label frequency (e.g. “day”, “week”, “3 weeks”, “month”, or “year”)
    • Use date_minor_breaks = to specify frequency of minor vertical gridlines between date labels
    • Add expand = c(0,0) to begin the labels at the first bar (otherwise, first label will shift forward depending on specified frequency)
    • Use date_labels = to specify format of date labels - see the Dates page for tips (use \n for a new line)
  • If your histogram bins are Sunday weeks:
    • Use breaks = and minor_breaks = by providing a sequence of dates for breaks
    • You can still use date_labels = for formatting as described above

To create a sequence of dates
You can use seq.Date() from base R. You can start/end the sequence at a specific date (as.Date("YYYY-MM-DD"), or write flexible code to begin the sequence at a specific day of the week before the first case. An example of creating such flexible breaks is below:

seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
         to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
         by   = "7 days")

To achieve the “from” value (earliest date of the sequence), the minimum value in the column date_onset is fed to floor_date() from the lubridate package, which according to the above specified arguments produces the start date of that “week”, given that the start of each week is a Monday (week_start = 1). Likewise, the “to” value (end date of the sequence) is specified using the inverse ceiling_date() function to produce the Monday after the last case. The “by” argument can be set to any length of days, weeks, or months.

If using aggregated counts (for example an epiweek x-axis) your x-axis may not be Date class and may require use scale_x_discrete() instead of scale_x_date() - see ggplot tips page for more details.

Set maximum and minimum date values using limits = c() within scale_x_date(). E.g. scale_x_date(limits = c(as.Date("2014-04-01), NA)) sets a minimum but leaves the maximum open.

CAUTION: Caution using limits! They remove all data outside the limits, which can impact y-axis max/min, modeling, and other statistics. Strongly consider instead using limits by adding coord_cartesian() to your plot, which acts as a “zoom” without removing data.

DANGER: Be cautious setting the y-axis scale breaks (e.g. 0 to 30 by 5: seq(0, 30, 5)). Static numbers can cut-off your data if the data changes!.

https://rdrr.io/r/base/strptime.html —– see all % shortcuts

Below is a demonstration of some plots where the bins and the plot labels/gridlines are aligned and not aligned:
Click “Code” to see the code

# 7-day binwidth defaults
#################
ggplot(central_data, aes(x = date_onset)) + # x column must be class Date
  geom_histogram(
    binwidth = 7,                       # 7 days per bin (! starts at first case!)
    color = "darkblue",                 # color of lines around bars
    fill = "lightblue") +               # color of bar fill
  
  labs(
    title = "MISALIGNED",
    subtitle = "!CAUTION: 7-day bars start Thursdays with first case\ndefault axis labels/ticks not aligned")


# 7-day bins + Monday labels
#############################
ggplot(central_data, aes(x = date_onset)) +
  geom_histogram(
    binwidth = 7,                 # 7-day bins with start at first case
    color = "darkblue",
    fill = "lightblue") +
  
  scale_x_date(
    expand = c(0,0),               # remove excess x-axis space below and after case bars
    date_breaks = "3 weeks",       # Monday every 3 weeks
    date_minor_breaks = "week",    # Monday weeks
    date_labels = "%d\n%b\n'%y")+  # label format
  
  scale_y_continuous(
    expand = c(0,0))+              # remove excess space under x-axis, make flush with labels
  
  labs(
    title = "MISALIGNED",
    subtitle = "!CAUTION: 7-day bars start Thursdays with first case\nDate labels and gridlines on Mondays")



# 7-day bins + Months
#####################
ggplot(central_data, aes(x = date_onset)) +
  geom_histogram(
    binwidth = 7,
    color = "darkblue",
    fill = "lightblue") +
  
  scale_x_date(
    expand = c(0,0),                 # remove excess x-axis space below and after case bars
    date_breaks = "months",          # 1st of month
    date_minor_breaks = "week",      # Monday weeks
    date_labels = "%d\n%b\n'%y")+    # label format
  
  scale_y_continuous(
    expand = c(0,0))+                # remove excess space under x-axis, make flush with labels
  
  labs(
    title = "MISALIGNED",
    subtitle = "!CAUTION: 7-day bars start Thursdays with first case\nGridlines at 1st of each month (with labels) and weekly on Mondays\nLabels on 1st of each month")


# TOTAL MONDAY ALIGNMENT: specify manual bin breaks to be mondays
#################################################################
ggplot(central_data, aes(x = date_onset)) + 
  geom_histogram(
    # histogram breaks set to 7 days beginning Monday before first case
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
      by   = "7 days"),
    color = "darkblue",
    fill = "lightblue") + 
  
  scale_x_date(
    expand = c(0,0),                   # remove excess x-axis space below and after case bars
    date_breaks = "3 weeks",           # Monday every 3 weeks
    date_minor_breaks = "week",        # Monday weeks 
    date_labels = "%d\n%b\n'%y")+      # label format
  
  labs(
    title = "ALIGNED Mondays",
    subtitle = "7-day bins manually set to begin Monday before first case (28 Apr)\nDate labels and gridlines on Mondays as well")


# TOTAL SUNDAY ALIGNMENT: specify manual bin breaks AND labels to be Sundays
############################################################################
ggplot(central_data, aes(x = date_onset)) + 
  geom_histogram(
    # histogram breaks set to 7 days beginning Sunday before first case
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                      by   = "7 days"),
    color = "darkblue",
    fill = "lightblue") + 
  
  scale_x_date(
    expand = c(0,0),
    # date label breaks set to every 3 weeks beginning Sunday before first case
    breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                      by   = "3 weeks"),
    # gridlines set to weekly beginning Sunday before first case
    minor_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 7)),
                            to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 7)),
                            by   = "7 days"),
    date_labels = "%d\n%b\n'%y")+  # label format
  
  labs(title = "ALIGNED Sundays",
       subtitle = "7-day bins manually set to begin Sunday before first case (27 Apr)\nDate labels and gridlines manually set to Sundays as well")



# Check values of bars by creating dataframe of grouped values
# central_tab <- central_data %>% 
#   mutate(week = aweek::date2week(date_onset, floor_day = TRUE, factor = TRUE)) %>% 
#   group_by(week, .drop=F) %>%
#   summarize(n = n()) %>% 
#   mutate(groups_3wk = 1:(nrow(central_tab)+1) %/% 3) %>% 
#   group_by(groups_3wk) %>% 
#   summarize(n = n())

Color by groups

Color by groups

Designate a column containing groups

In any of the code template (Sunday weeks, Monday weeks), make the following changes:

  • Add the aesthetics argument aes() within the geom_histogram() (don’t forget comma afterward)
  • Within aes(), provide the grouping column name to group = and fill = (no quotes needed). group is necessary, while fill changes the color of the bar.
  • Remove any fill = argument outside of the aes(), as it will override the one inside
  • Arguments inside aes() will apply by group, whereas any outside will apply to all bars (e.g. you may want color = outside, so each bar has the same color perimeter/border)
geom_histogram(
    aes(group = gender, fill = gender))

Adjust colors:

  • To manually adjust the bar fill color of each group, use scale_fill_manual() (note scale_color_manual() is different!).
    • Use the values = argument to apply a vector of colors.
    • Use na.value = to specify a color for missing values.
    • ! While you can use the labels = argument in scale_fill_manual() change the legend text labels - it is easy to accidentally give labels in the incorrect order and have an incorrect legend! It is recommended to instead convert the group column to class Factor and designate factor labels and order, as explained below.
  • To adjust the colors via a color scale, see the page on ggplot tips

Adjust the stacking order and Legend

Stacking order, and the labels for each group in the legend, is best adjusted by classifying the group column as class Factor. You can then designate the levels and their labels, and the order (which is reflected in stack order).

Step 1: Before making the ggplot, convert the group column to class Factor using factor() from base R.
For example, with a column “gender” with values “m” and “f” and NA, this can be put in a mutate() command as:

dataset <- dataset %>% 
  mutate(gender = factor(gender,
                    levels = c(NA, "f", "m"),
                    labels = c("Missing", "Female", "Male"),
                    exclude = NULL))

The above code establishes the levels, in the ordering that missing values are “first” (and will appear on top). Then the labels that will show are given in the same order. Lastly, the exclude statement ensures that NA is included in the ordering (by default factor() ignores NA).

Read more about factors in their dedicated handbook page (LINK).

Adjusting the legend

Read more about legends in the ggplot tips page. Here are a few highlights:

  • theme(legend.position = "top") (or “bottom”, “left”, “right”)
  • theme(legend.direction = "horizontal")
  • theme(legend.title = element_blank()) to have no title

See the page of ggplot() tips for more details on legends.

These steps are shown in the example below:

Click “Code” to see the code

########################
# bin break points for histogram defined here for clarity
# starts the Monday before first case, end Monday after last case
bin_breaks = seq.Date(
  from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
  to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
  by   = "7 days") # bins are 7-days

# Set gender as factor and missing values as first level (to show on top)
central_data <- linelist %>%
  filter(hospital == "Central Hospital") %>% 
  mutate(gender = factor(
    gender,
    levels = c(NA, "f", "m"),
    labels = c("Missing", "Female", "Male"),
    exclude = NULL))  

# make plot
###########
ggplot(central_data, aes(x = date_onset)) + 
  geom_histogram(
    aes(group = gender, fill = gender),    # arguments inside aes() apply by group
    color = "black",                       # arguments outside aes() apply to all data
    breaks = bin_breaks)+                  # see breaks defined above
                      
  
  # The labels on the x-axis
  scale_x_date(
    expand            = c(0,0),         # remove excess x-axis space below and after case bars
    date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
    date_minor_breaks = "week",         # vertical lines appear every Monday week
    date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(
    expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  #scale of colors and legend labels
  scale_fill_manual(
    values = c("grey", "orange", "purple"))+ # specify fill colors ("values") - attention to order!

  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(
    plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
    axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(
    title    = "Weekly incidence of cases, by gender",
    subtitle = "Subtitle",
    fill     = "Gender",                                      # provide new title for legend
    x        = "Week of symptom onset",
    y        = "Weekly incident cases reported",
    caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Display bars side-by-side

Side-by-side display of group bars (as opposed to stacked) is specified within geom_histogram() with position = "dodge".
If there are more than two value groups, these can become difficult to read. Consider instead using a faceted plot (small multiples) (see tab). To improve readability in this example, missing gender values are removed.

Click “Code” to see the code

########################
# bin break points for histogram defined here for clarity
# starts the Monday before first case, end Monday after last case
bin_breaks = seq.Date(from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
                      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
                      by   = "7 days") # bins are 7-days

# New dataset without rows missing gender
central_data_dodge <- linelist %>% 
  filter(hospital == "Central Hospital") %>% 
  filter(!is.na(gender)) %>%                            # remove rows missing gender
  mutate(gender = factor(gender,                        # factor now has only two levels (missing not included)
                         levels = c("f", "m"),
                         labels = c("Female", "Male")))  

# make plot
###########
ggplot(central_data_dodge, aes(x = date_onset)) + 
    geom_histogram(
        aes(group = gender, fill = gender),    # arguments inside aes() apply by group
        color = "black",                       # arguments outside aes() apply to all data
        breaks = bin_breaks,
        position = "dodge")+                  # see breaks defined above
                      
  
  # The labels on the x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "3 weeks",      # labels appear every 3 Monday weeks
               date_minor_breaks = "week",         # vertical lines appear every Monday week
               date_labels       = "%d\n%b\n'%y")+ # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  #scale of colors and legend labels
  scale_fill_manual(values = c("pink", "lightblue"))+     # specify fill colors ("values") - attention to order!

  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"))+               # axis titles in bold
  
  # labels
  labs(title    = "Weekly incidence of cases, by gender",
       subtitle = "Subtitle",
       fill     = "Gender",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Faceting/small-multiples

Faceting/small-multiples

As with other ggplots, you can create facetted plots (“small multiples”) off values in a column. As explained in the ggplot tips page of this handbook, you can use either:

  • facet_wrap()
  • facet_grid()

For epicurves, facet_wrap() is typically easiest as it is likely that you only need to facet on one column. The general syntax is facet_wrap(rows ~ cols), where to the left of the tilde (~) is the name of a column to be spread across the “rows” of the new plot, and to the right of the tilde is the name of a column to be spread across the “columns” of the new plot.

Most simply, just use one column name, to the right of the tilde: facet_wrap(~age_cat).

Free axes
You will need to decide whether the scales (scales =) of the axes for each facet are “fixed” to the same dimensions (default), or “free” (meaning they will change based on the data within the facet). You can also specify “free_x” or “free_y” to release in only one dimension.

Number of cols and rows
This can be specified with ncol = and nrow = within facet_wrap().

Order of panels
To change the order of appearance, change the underlying order of the levels of the factor column used to create the facets.

Aesthetics
Font size and face, strip color, etc. can be modified through theme() with arguments like:

  • strip.text = element_text() (size, colour, face, angle…)
  • strip.background = element_rect() (e.g. element_rect(fill=“red”))

The position of the strip can be modified as the strip.position = argument within facet_wrap() (e.g. “bottom”, “top”, “left”, “right”)

Strip labels
Labels of the facet plots can be modified through the “labels” of the column as a factor, or by the use of a “labeller”.

Make a labeller like this, using the function as_labeller() from ggplot2:

my_labels <- as_labeller(c(
     "0-4"   = "Ages 0-4",
     "5-9"   = "Ages 5-9",
     "10-14" = "Ages 10-14",
     "15-19" = "Ages 15-19",
     "20-29" = "Ages 20-29",
     "30-49" = "Ages 30-49",
     "50-69" = "Ages 50-69",
     "70+"   = "Over age 70"))

An example plot
Faceted by column age_cat. Click “Code” to see the code.

# make plot
###########
ggplot(central_data, aes(x = date_onset)) + 
  
  geom_histogram(
        aes(group = age_cat, fill = age_cat),    # arguments inside aes() apply by group
        color = "black",                       # arguments outside aes() apply to all data
        breaks = bin_breaks)+                  # see breaks defined above
                      
    
  
  # The labels on the x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "2 months",     # labels appear every 2 months
               date_minor_breaks = "1 month",      # vertical lines appear every 1 month 
               date_labels       = "%b\n'%y")+     # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"),
        legend.position = "bottom",
        strip.text = element_text(face = "bold", size = 10),
        strip.background = element_rect(fill = "grey"))+               # axis titles in bold
  
  # create facets
  facet_wrap(~age_cat,
             ncol = 4,
             strip.position = "top",
             labeller = my_labels)+             
  
  # labels
  labs(title    = "Weekly incidence of cases, by age category",
       subtitle = "Subtitle",
       fill     = "Age category",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

See this link for more information on labellers.

Add total epidemic to background
Add a separate geom_histogram() command before the current one. Specify that the data used is the data without the column used for faceting (see select()). Then, specify a color like “grey” and a degree of transparency to make it appear in the background.

geom_histogram(data = select(central_data, -age_cat), color = "grey", alpha = 0.5)+

Note that the y-axis maximum is now based on the height of the entire epidemic. Click “Code” to see the code.

ggplot(central_data, aes(x = date_onset)) + 
  
  # for background shadow of whole outbreak
  geom_histogram(data = select(central_data, -age_cat), color = "grey", alpha = 0.5)+

  # actual epicurves by group
  geom_histogram(
        aes(group = age_cat, fill = age_cat),  # arguments inside aes() apply by group
        color = "black",                       # arguments outside aes() apply to all data
        breaks = bin_breaks)+                  # see breaks defined above
                      
  # Labels on x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "2 months",     # labels appear every 2 months
               date_minor_breaks = "1 month",      # vertical lines appear every 1 month 
               date_labels       = "%b\n'%y")+     # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"),
        legend.position = "bottom",
        strip.text = element_text(face = "bold", size = 10),
        strip.background = element_rect(fill = "white"))+               # axis titles in bold
  
  # create facets
  facet_wrap(~age_cat,                          # each plot is one value of age_cat
             ncol = 4,                          # number of columns
             strip.position = "top",            # position of the facet title/strip
             labeller = my_labels)+             # labeller defines above
  
  # labels
  labs(title    = "Weekly incidence of cases, by age category",
       subtitle = "Subtitle",
       fill     = "Age category",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Create one facet with ALL data
To do this, you duplicate all the data (double the number of rows in the dataset) and in the faceted column have a new value (e.g. “all”) which indicates all the duplicated rows. A helped function is below that enables this:

# Define helper function
CreateAllFacet <- function(df, col){
     df$facet <- df[[col]]
     temp <- df
     temp$facet <- "all"
     merged <-rbind(temp, df)
     
     # ensure the facet value is a factor
     merged[[col]] <- as.factor(merged[[col]])
     
     return(merged)
}

# Create dataset that is duplicated, to show "all zones" as another facet level
central_data2 <- CreateAllFacet(central_data, col = "age_cat") %>%
  mutate(facet = factor(facet,
                        levels = c("all", "0-4", "5-9", "10-14", "15-19", "20-29", "30-49", "50-69", "70+")))

# check
table(central_data2$facet, useNA = "always")
## 
##   all   0-4   5-9 10-14 15-19 20-29 30-49 50-69   70+  <NA> 
##   454    84    87    77    73    84    37     0     0    12

Notable changes to the ggplot command are:

  • The data used is now central_data2 (double the rows, with new column “facet”)
  • Labeller will need to be updated, if used
  • To achieve long/thin plot, facet variable moved to rows side of equation, replaced by “.” facet_wrap(facet~.), and ncol = 1

You may also need to adjust the width and height of the save plot image (see ggsave()).

ggplot(central_data2, aes(x = date_onset)) + 
  
  # actual epicurves by group
  geom_histogram(
        aes(group = age_cat, fill = age_cat),  # arguments inside aes() apply by group
        color = "black",                       # arguments outside aes() apply to all data
        breaks = bin_breaks)+                  # see breaks defined above
                      
  # Labels on x-axis
  scale_x_date(expand            = c(0,0),         # remove excess x-axis space below and after case bars
               date_breaks       = "2 months",     # labels appear every 2 months
               date_minor_breaks = "1 month",      # vertical lines appear every 1 month 
               date_labels       = "%b\n'%y")+     # date labels format
  
  # y-axis
  scale_y_continuous(expand = c(0,0))+                   # removes excess y-axis space between bottom of bars and the labels
  
  # aesthetic themes
  theme_minimal()+                                               # a set of themes to simplify plot
  theme(plot.caption = element_text(face = "italic", hjust = 0), # caption on left side in italics
        axis.title = element_text(face = "bold"),
        legend.position = "bottom")+               
  
  # create facets
  facet_wrap(facet~. ,                            # each plot is one value of facet
             ncol = 1)+            

  # labels
  labs(title    = "Weekly incidence of cases, by age category",
       subtitle = "Subtitle",
       fill     = "Age category",                                      # provide new title for legend
       x        = "Week of symptom onset",
       y        = "Weekly incident cases reported",
       caption  = stringr::str_glue("n = {nrow(central_data)} from Central Hospital; Case onsets range from {format(min(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')} to {format(max(central_data$date_onset, na.rm=T), format = '%a %d %b %Y')}\n{nrow(central_data %>% filter(is.na(date_onset)))} cases missing date of onset and not shown"))

Moving averages

Moving averages

Add a moving averages to a ggplot() epicurve in one of two ways:

  1. Plot the pre-calculated moving average:
    • Aggregate the data as necessary (daily, weekly, etc.)
    • Calculate the moving average
    • Add the moving average to the ggplot (e.g. with geom_line())
  2. Calculate on-the-fly within the ggplot() command
Using slider

In this approach, the moving average is calculated in the dataset prior to plotting:

  • Within mutate(), a new column is created to hold the average. slide_index() from slider package is used as shown below.
  • In the ggplot(), a geom_line() is added after the histogram, reflecting the moving average.

See the helpful online vignette for the slider package

pacman::p_load(slider)  # slider used to calculate rolling averages

# make dataset of daily counts and 7-day moving average
#######################################################
ll_counts_7day <- linelist %>% 
  ## count cases by date
  count(date_onset,
        name = "new_cases") %>%   # name of new column
  filter(!is.na(date_onset)) %>%  # remove cases with missing date_onset
  
  ## calculate the average number of cases in the preceding 7 days
  mutate(
    avg_7day = slider::slide_index(    # create new column
      new_cases,                       # calculate based on value in new_cases column
      .i = date_onset,                 # index is date_onset col, so non-present dates are included in window 
      .f = ~mean(.x, na.rm = TRUE),    # function is mean() with missing values removed
      .before = 6,                     # window is the day and 6-days before
      .complete = FALSE),              # must be FALSE for unlist() to work in next step
    avg_7day = unlist(avg_7day))


# plot
######
ggplot(data = ll_counts_7day, aes(x = date_onset)) +
    geom_histogram(aes(y = new_cases),
                   fill="#92a8d1",
                   stat = "identity",
                   position = "stack",
                   colour = "#92a8d1")+ 
    geom_line(aes(y = avg_7day, lty = "7-day \nrolling avg"),
              color="red",
              size = 1) + 
    scale_x_date(date_breaks = "1 month",
                 date_labels = '%d/%m',
                 expand = c(0,0)) +
    scale_y_continuous(expand = c(0,0),
                       limits = c(0, NA)) + 
    labs(x="",
         y ="Number of confirmed cases",
         fill = "Legend")+ 
    theme_minimal()+
    theme(legend.title = element_blank())  # removes title of legend

Using tidyquant

Using the tidyquant package to calculate the moving average on-the-fly (within ggplot()).

This option is more difficult to modify than pre-calculating the moving average. By default,geom_ma() uses the Simple Moving Average (SMA) (TRR::SMA()). See documentation by entering ?SMA in your R console. Calculates the arithmatic mean over the past n observations. Also note how the moving average does not begin as early as the previous example.

library(tidyquant)

# make daily count data
#######################
ll_counts_7day <- linelist %>% 
  count(date_onset, name = "daily_cases")


# plot
######
ggplot(data = ll_counts_7day,   # use daily count data
       aes(x = date_onset,      # date x-axis
           y = daily_cases))+   # counts
  
  # histogram in the background
  geom_histogram(stat = "identity",    # height = value in the cell, not number of rows
                 color = "#92a8d1",    # color of lines within histogram
                 fill = "#92a8d1")+    # color of histogram
  
  # moving average line
  tidyquant::geom_ma(n = 7,            # window width
                     size = 2,         # line size
                     color = "black",  # line color
                     lty = "solid"     # line type ()
                     )+
     
  # labels for x-axis
  scale_x_date(date_breaks = "2 months",      # labels every 2 months 
               date_minor_breaks = "1 month", # gridlines every month
               date_labels = '%b\n%Y')+       #labeled by month with year below
     
  # Choose color palette (uses RColorBrewer package)
  scale_fill_brewer(palette = "Pastel2")+ 
  
  theme_minimal()+
  
  labs(x = "Date of onset", 
       y = "Daily case incidence",
       title = "Daily case incidence, with 7-day moving average")

Tentative data

Tentative data

The most recent data shown in epicurves should often be marked as tentative, or subject to reporting delays. This can be done in by adding a vertical line and/or rectangle over a specified number of days. Here are two options:

  1. Use annotate():
  • Pros: Transparency of rectangle is easy. Cons: Items will not appear in legend.
  • For a line use annotate(geom = "segment"). Provide x, xend, y, and yend. Adjust size, linetype (lty), and color.
  • For a rectangle use annotate(geom = "rect"). Provide xmin/xmax/ymin/ymax. Adjust color and alpha.
  1. Use geom_segment() and geom_rect():
  • Pros: Items can easily appear in legend. Cons: Difficult to achieve semi-transparency of rectangle.
  • Provide the same x/y arguments as noted above for annotate()

CAUTION: While you can use geom_rect() to draw a rectangle, adjusting the transparency (alpha) does not work in a linelist context. This function overlays a rectangle for each observation/row!. Try a very low alpha (e.g. 0.01), or use annotate(geom = "rect") as shown.

Using annotate()
  • Within annotate(geom = "rect"), the xmin and xmax arguments must be given inputs of class Date.
  • Note that because these data are aggregated into weekly bars, and the last bar extends to the Monday after the last data point, the shaded region may appear to cover 4 weeks
  • annotate() online example
ggplot(central_data, aes(x = date_onset)) + 
  
  # histogram
  geom_histogram(
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
      by   = "7 days"),
    color = "darkblue",
    fill = "lightblue") +

  # scales
  scale_y_continuous(expand = c(0,0))+
  scale_x_date(
    expand = c(0,0),                   # remove excess x-axis space below and after case bars
    date_breaks = "1 month",           # 1st of month
    date_minor_breaks = "1 month",     # 1st of month
    date_labels = "%b\n'%y")+          # label format
  
  # labels and theme
  labs(title = "Using annotate()\nRectangle and line showing that data from last 21-days are tentative",
    x = "Week of symptom onset",
    y = "Weekly case indicence")+ 
  theme_minimal()+
  
  # add semi-transparent red rectangle to tentative data
  annotate("rect",
           xmin  = as.Date(max(central_data$date_onset, na.rm = T) - 21), # note must be wrapped in as.Date()
           xmax  = as.Date(Inf),                                          # note must be wrapped in as.Date()
           ymin  = 0,
           ymax  = Inf,
           alpha = 0.2,          # alpha easy and intuitive to adjust using annotate()
           fill  = "red")+
  
  # add black vertical line on top of other layers
  annotate("segment",
           x     = max(central_data$date_onset, na.rm = T) - 21, # 21 days before last data
           xend  = max(central_data$date_onset, na.rm = T) - 21, 
           y     = 0,         # line begins at y = 0
           yend  = Inf,       # line to top of plot
           size  = 2,         # line size
           color = "black",
           lty   = "solid")+   # linetype e.g. "solid", "dashed"

  # add text in rectangle
  annotate("text",
           x = max(central_data$date_onset, na.rm = T) - 15,
           y = 20,
           label = "Subject to reporting delays",
           angle = 90)

The same black vertical line can be achieved with the code below, but using geom_vline() you lose the ability to control the height:

geom_vline(xintercept = max(central_data$date_onset, na.rm = T) - 21,
           size = 2,
           color = "black")
Using geom_segment() and geom_rect()
ggplot(central_data, aes(x = date_onset)) + 
  
  # histogram
  geom_histogram(
    breaks = seq.Date(
      from = as.Date(floor_date(min(central_data$date_onset, na.rm=T),   "week", week_start = 1)),
      to   = as.Date(ceiling_date(max(central_data$date_onset, na.rm=T), "week", week_start = 1)),
      by   = "7 days"),
    color = "darkblue",
    fill = "lightblue") +

  # scales
  scale_y_continuous(expand = c(0,0))+
  scale_x_date(
    expand = c(0,0),                   # remove excess x-axis space below and after case bars
    date_breaks = "3 weeks",           # Monday every 3 weeks
    date_minor_breaks = "week",        # Monday weeks 
    date_labels = "%d\n%b\n'%y")+      # label format
  
  # labels and theme
  labs(title = "Using geom_segment() and geom_rect()\nRectangle and line showing that data from last 21-days are tentative",
    subtitle = "")+ 
  theme_minimal()+
  
  # make rectangle covering last 21 days
  geom_rect(aes(
              xmin  = as.Date(max(central_data$date_onset, na.rm = T) - 21), # note must be wrapped in as.Date()
              xmax  = as.Date(Inf),                                          # note must be wrapped in as.Date()
              ymin  = 0,
              ymax  = Inf,
              color = "Reporting delays\npossible"),    # sets label for legend (note: is within aes())
              alpha = .002,                             # !!! Difficult to adjust transparency with this option
           fill  = "red")+
  
  # make vertical line
  geom_segment(aes(x = max(central_data$date_onset, na.rm = T) - 21,
                   xend = max(central_data$date_onset, na.rm = T) - 21,
                   y = 0,
                   yend = Inf),
               color = "black",
               lty = "solid",
               size = 2)+
  theme(legend.title = element_blank())                 # remove title of legend

Dual axis

Dual axis

Two axes

Multi-level date labels

Multi-level date labels

Here is an option if you want multi-level date labels, without duplicating the lower label levels (e.g. for year or month).

Remember, you can can use tools like \n within the date_labels or labels arguments to put parts of each label on a new line below. However, the code below helps you take years or months (for example) on a lower line and only once.

A few notes on the code below:

  • Case counts are aggregated into weeks for aesthetic reasons. See Epicurves page (aggregated data tab) for details.
  • A line is used instead of a histogram, as the faceting approach below does not work well with histograms.

Aggreagate the weekly counts


# Create dataset of case counts by week
#######################################
central_weekly <- linelist %>%
  filter(hospital == "Central Hospital") %>%           # filter linelist
  mutate(week = lubridate::floor_date(date_onset, unit = "weeks")) %>%  
  count(week, .drop=F) %>%                             # summarize weekly case counts
  filter(!is.na(week)) %>%                             # remove cases with missing onset_date
  complete(week = seq.Date(from = min(week),           # fill-in all weeks with no cases reported
                           to   = max(week),
                           by   = "week"))

Make plots

# plot
######
ggplot(central_weekly) +
  geom_line(aes(x = week, y = n),    # make line, specify x and y
            stat = "identity") +             # because line height is count number
  scale_x_date(date_labels="%b",             # date label format show month 
               date_breaks="month",          # date labels on 1st of each month
               expand=c(0,0)) +              # remove excess space
  facet_grid(~lubridate::year(week), # facet on year (of Date class column)
             space="free_x",                
             scales="free_x",                # x-axes adapt to data range (not "fixed")
             switch="x") +                   # facet labels (year) on bottom
  theme_bw() +
  theme(strip.placement = "outside",         # facet labels placement
        strip.background = element_rect(fill = NA, # facet labels no fill grey border
                                        colour = "grey50"),
        panel.spacing = unit(0, "cm"))+      # no space between facet panels
  labs(title = "Nested year labels, grey label border")



# plot no border
################
ggplot(central_weekly,
       aes(x = week, y = n)) +              # establish x and y for entire plot
  geom_line(stat = "identity",              # make line, line height is count number
            color = "#69b3a2") +            # line color
  geom_point(size=1, color="#69b3a2") +     # make points at the weekly data points
  geom_area(fill = "#69b3a2",               # fill area below line
            alpha = 0.4)+                   # fill transparency
  scale_x_date(date_labels="%b",            # date label format show month 
               date_breaks="month",         # date labels on 1st of each month
               expand=c(0,0)) +             # remove excess space
  facet_grid(~lubridate::year(week),   # facet on year (of Date class column)
             space="free_x",                
             scales="free_x",               # x-axes adapt to data range (not "fixed")
             switch="x") +                  # facet labels (year) on bottom
  theme_bw() +
  theme(strip.placement = "outside",                     # facet label placement
          strip.background = element_blank(),            # no facet lable background
          panel.grid.minor.x = element_blank(),          
          panel.border = element_rect(colour="grey40"),  # grey border to facet PANEL
          panel.spacing=unit(0,"cm"))+                   # No space between facet panels
  labs(title = "Nested year labels - points, shaded, no label border")

The above techniques were adapted from this and this post on stackoverflow.com.

Aggregating linelist data

Aggregating linelist data

To learn generally how to group and aggregate data, see the handbook page on Grouping/Aggregating.

In this circumstance, we demonstrate aggregating into weeks, months, and days.

Weeks

Create a new column that is weeks, then use group_by() with summarize() to get weekly case counts.

To aggregate into weeks and show ALL weeks (even ones with no cases), do this:

  1. Create a new ‘week’ column within mutate(), using floor_date() from the lubridate package:
    • use unit = to set the desired time unit, e.g. "week`
    • use week_start = to set the weekday start of the week (7 = Sunday, 1 = Monday)
  2. Follow with complete() to ensure that all weeks appear - even those with no cases.

For example:

# Make dataset of weekly case counts
weekly_counts <- linelist %>% 
  mutate(
    week = lubridate::floor_date(date_onset,
                                 unit = "week")) %>%  # new column of week of onset
  count(week) %>%                                     # group data by week and count rows per group
  filter(!is.na(week)) %>%                            # remove entries for cases missing date_onset
  complete(week = seq.Date(from = min(week),          # fill-in all weeks with no cases reported
                           to = max(week),
                           by="week")) %>% 
  ungroup()                                           # deactivate grouping

Here are the first 50 rows of the resulting dataframe:

Alternatively, you can use the aweek package’s date2week() function. As shown below, set week_start = to “Sunday”, or “Monday”, etc. Set floor_date = TRUE so the output is YYYY-Www. Set factor = TRUE so that all possible weeks are included, even if there are no cases (this replaces the complete() step in the lubridate approach above). You can also use numeric = TRUE if you want only the week number (note this will not distinguish between years).

# Make dataset of weekly case counts
weekly_counts <- linelist %>% 
  mutate(week = aweek::date2week(date_onset,          # new column of week of onset
                                 floor_day = T,       # show as weeks without weekday
                                 factor = TRUE)) %>%  # include all possible weeks
  count(week) %>% 
  ungroup()                                           # deactivate grouping

# Optional: add column of start DATE for each week - e.g. for ggplot() when date x-axis is expected
# note: add this step AFTER the above code, to ensure all weeks are present
weekly_counts <- weekly_counts %>% 
  mutate(week_as_date = aweek::week2date(week, week_start = "Monday")) # output is Monday date of each week
Months

To aggregate cases into months, again use floor_date() from the lubridate package, but with the argument unit = "months". This rounds each date down to the 1st of its month. The output will be class Date.

Note that in the complete() step we also use “months”

# Make dataset of weekly case counts
monthly_counts <- linelist %>% 
  mutate(month = lubridate::floor_date(date_onset, unit = "months")) %>%   # new column, 1st of month of onset
  count(month) %>% 
  filter(!is.na(month)) %>% 
  complete(month = seq.Date(min(month),     # fill-in all months with no cases reported
                            max(month),
                            by="month"))    
Days

To aggregate a linelist into days, use the same approach but there is no need to create a new column. Use group_by() on the date column (e.g. date_onset).

If plotting a histogram, missing days in the data are not a problem as long as the column is class Date. However, it may be important for other types of plots or tables to have all possible days apear in the data. This is done with: tidyr::complete()

# Make dataset of weekly case counts
daily_counts <- linelist %>% 
  count(date_onset) %>%                           # count number of rows per unique date
  filter(!is.na(date_onset)) %>%                  # remove aggregation of rows that were missing date_onset
  complete(date_onset = seq.Date(min(date_onset), # ensure all days appear
                                 max(date_onset),
                                 by="day"))  

Aggregated data

Aggregated data

Often instead of a linelist, you begin with aggregated counts from facilities, districts, etc. You can make an epicurve from ggplot() but the code will be slightly different. The incidence package does not support aggregate data.

This section will utilize the count_data dataset that was imported earlier, in the data preparation section. It is the linelist aggregated to day-hospital counts. The first 50 rows are displayed below.

As before, we must ensure date variables are correctly classified.

# Convert Date variable to Date class
class(count_data$date_hospitalisation)
## [1] "Date"

We can plot a daily epicurve from these data. Here are the differences:

  • Specify y = to the counts column within the primary aesthetics aes()
  • Use of stat = "identity" within geom_histogram() indicates that the y-values could be counts from the y = column in aes()
ggplot(data = count_data, aes(x = as.Date(date_hospitalisation), y = n_cases))+
     geom_histogram(stat = "identity")+
     labs(x = "Week of report", 
          y = "Number of cases",
          Title = "Daily case incidence, from daily count data")

aggregate further

To aggregated further, into weeks, we use the package lubridate and function floor_date(), as described above. Note that we use group_by() and summarize() in place of count() becase we need to sum() case counts instead of just counting the number of rows per group.

# Create weekly dataset with epiweek column
count_data_weekly <- count_data %>%
  mutate(epiweek = lubridate::floor_date(date_hospitalisation, "week")) %>% 
  group_by(hospital, epiweek, .drop=F) %>% 
  summarize(n_cases_weekly = sum(n_cases, na.rm=T))   

The first 50 rows of count_data are displayed below.

For the plotting we also specify the factor level order of hospital.

count_data_weekly <- count_data_weekly %>% 
  mutate(hospital = factor(hospital,
                           levels = c("Missing", "Port Hospital", "Military Hospital", "Central Hospital", "St. Mark's Maternity Hospital (SMMH)", "Other")))

Now plot by epiweek.

ggplot(data = count_data_weekly,
       aes(x = epiweek,
           y = n_cases_weekly,
           group = hospital,
           fill = hospital))+
  
  geom_histogram(stat = "identity")+
     
  # labels for x-axis
  scale_x_date(date_breaks = "2 months",      # labels every 2 months 
               date_minor_breaks = "1 month", # gridlines every month
               date_labels = '%b\n%Y')+       #labeled by month with year below
     
  # Choose color palette (uses RColorBrewer package)
  scale_fill_brewer(palette = "Pastel2")+ 
  
  theme_minimal()+
  
  labs(x = "Week of onset", 
       y = "Weekly case incidence",
       fill = "Hospital",
       title = "Weekly case incidence, from aggregated count data by hospital")

Dual-axis

Dual-axis

Although there are fierce discussions about the validity of this within the data visualization community, many supervisors want to see an epicurve or similar chart with a percent overlaid with a second axis.

In ggplot it is difficult to do this, except for the case where you are showing a line reflecting the proportion of a category shown in the bars below.

See the handbook page on ggplot tips for details on how to make a second axis.

Resources

Resources

Links to other online tutorials or resources.